With a few exceptions, work in offline reinforcement learning (RL) has so far assumed that there is no confounding. In a classical regression setting, confounders introduce omitted variable bias and inhibit the identification of causal effects. In offline RL, they prevent the identification of a policy's value, and therefore make it impossible to perform policy improvement. Using conventional methods in offline RL in the presence of confounding can therefore not only lead to poor decisions and poor policies, but can also have disastrous effects in applications such as healthcare and education. We provide approaches for both off-policy evaluation (OPE) and local policy optimization in the settings of i.i.d. and global confounders. Theoretical and empirical results confirm the validity and viability of these methods.
translated by 谷歌翻译
We consider the stochastic linear contextual bandit problem with high-dimensional features. We analyze the Thompson sampling (TS) algorithm, using special classes of sparsity-inducing priors (e.g. spike-and-slab) to model the unknown parameter, and provide a nearly optimal upper bound on the expected cumulative regret. To the best of our knowledge, this is the first work that provides theoretical guarantees of Thompson sampling in high dimensional and sparse contextual bandits. For faster computation, we use spike-and-slab prior to model the unknown parameter and variational inference instead of MCMC to approximate the posterior distribution. Extensive simulations demonstrate improved performance of our proposed algorithm over existing ones.
translated by 谷歌翻译
Recently, Robey et al. propose a notion of probabilistic robustness, which, at a high-level, requires a classifier to be robust to most but not all perturbations. They show that for certain hypothesis classes where proper learning under worst-case robustness is \textit{not} possible, proper learning under probabilistic robustness \textit{is} possible with sample complexity exponentially smaller than in the worst-case robustness setting. This motivates the question of whether proper learning under probabilistic robustness is always possible. In this paper, we show that this is \textit{not} the case. We exhibit examples of hypothesis classes $\mathcal{H}$ with finite VC dimension that are \textit{not} probabilistically robustly PAC learnable with \textit{any} proper learning rule. However, if we compare the output of the learner to the best hypothesis for a slightly \textit{stronger} level of probabilistic robustness, we show that not only is proper learning \textit{always} possible, but it is possible via empirical risk minimization.
translated by 谷歌翻译
In this paper, we study a sequential decision-making problem, called Adaptive Sampling for Discovery (ASD). Starting with a large unlabeled dataset, algorithms for ASD adaptively label the points with the goal to maximize the sum of responses. This problem has wide applications to real-world discovery problems, for example drug discovery with the help of machine learning models. ASD algorithms face the well-known exploration-exploitation dilemma. The algorithm needs to choose points that yield information to improve model estimates but it also needs to exploit the model. We rigorously formulate the problem and propose a general information-directed sampling (IDS) algorithm. We provide theoretical guarantees for the performance of IDS in linear, graph and low-rank models. The benefits of IDS are shown in both simulation experiments and real-data experiments for discovering chemical reaction conditions.
translated by 谷歌翻译
我们提出了一种非平稳核土匪的算法,该算法不需要事先了解非平稳性程度。该算法遵循通过解决平衡探索和剥削的优化问题获得的随机策略。当检测到奖励函数的更改时,它可以通过重新启动来适应非平稳性。我们的算法比以前在非平稳内核强盗设置上的工作更加动态遗憾。此外,当通过使用线性内核应用于非平稳线性匪徒设置时,我们的算法几乎是最小的最佳选择,可以解决非平稳的线性匪徒文献中的空旷问题。我们将算法扩展到使用神经网络,以动态调整特征映射到观察到的数据。我们证明了使用神经切线内核理论的延伸的动态遗憾。我们从经验上证明,我们的算法和扩展可以适应不同程度的非平稳性。
translated by 谷歌翻译
学习线性时间不变动态系统(LTID)的参数是当前兴趣的问题。在许多应用程序中,人们有兴趣联合学习多个相关LTID的参数,这仍然是未探究的日期。为此,我们开发一个联合估计器,用于学习共享常见基矩阵的LTID的过渡矩阵。此外,我们建立有限时间误差界限,取决于底层的样本大小,维度,任务数和转换矩阵的光谱属性。结果是在轻度规律假设下获得的,并在单独学习每个系统的比较中,展示从LTID的汇集信息汇总信息。我们还研究了错过过渡矩阵的联合结构的影响,并显示成立的结果在适度误操作的存在下是强大的。
translated by 谷歌翻译
我们研究了在几个课程之一的未知会员的对手对对手的反复游戏中保证对反对者的低遗憾的问题。我们添加了我们的算法是非利用的约束,因为对手缺乏使用算法的激励,我们无法实现超过一些“公平”价值的奖励。我们的解决方案是一组专家算法(LAFF),该算法(LAFF)在一组子算法内搜索每个对手课程的最佳算法,并在检测对手剥削证据时使用惩罚政策。通过依赖对手课的基准,我们展示了除了剥削者之外的可能对手统一地掩盖了Lublinear的遗憾,我们保证对手有线性遗憾。为了我们的知识,这项工作是第一个在多智能经纪人学习中提供遗憾和非剥削性的保证。
translated by 谷歌翻译
课程学习(CL)是一种常用的机器学习培训策略。但是,我们仍然缺乏对CL的利益的明确理论上了解。在本文中,我们在结构化和非结构化设置下研究了CL在Multitask线性回归问题中的好处。对于两个设置,我们使用提供最佳课程和没有Oracle的Oracle获得最小的Minimax速率,而代理必须自适应地学习良好的课程。我们的结果表明,自适应学习可能比非结构化环境中的Oracle学习更难地更难,但它仅在结构化设置中引入了一个小额外的术语。为了将理论与实践连接,我们通过比较其保证与上述最小值率的保证来提供具有最高局部预测增益的任务的受欢情的经验方法。
translated by 谷歌翻译
我们在线学习在对冲沟通马尔可夫决策过程中的在线学习。我们给出了一个算法,当过渡是确定性的时,在后可以的最佳固定的确定性策略中,获得$ O(\ sqrt {t})$的遗憾。我们还证明了在此设置中的遗憾下限,这是MDP参数中的多项式因素紧密。我们还给出了一种低效的算法,实现了在通信MDP(对转换动态的额外温和的限制)中实现$ O(\ SQRT {T})$后悔。
translated by 谷歌翻译
在因果强盗问题中,动作集包括关于因果图的变量的干预。最近几位研究人员研究了这种强盗问题并指出了他们的实际应用。然而,所有现有的作品都依赖于限制性和不切实际的假设,即学习者将全面了解因果图结构前期。在本文中,我们在不知道因果图的情况下开发新的因果强盗算法。我们的算法适用于因果树,因果林和一般的因果图。我们的算法的遗憾保证大大提高了温和条件下标准多臂强盗(MAB)算法的遗传。最后,我们证明了我们的温和条件是必要的:如果没有它们,不能比标准MAB算法更好。
translated by 谷歌翻译